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ESTIMATION IN DIRICHLET RANDOM EFFECTS MODELS 

By Minjung Kyung 1 , Jeff Gill 1 and George Casella 2 

University of Florida, Washington University and University of Florida 

We develop a new Gibbs sampler for a linear mixed model with 
a Dirichlet process random effect term, which is easily extended to a 
generalized linear mixed model with a probit link function. Our Gibbs 
sampler exploits the properties of the multinomial and Dirichlet dis- 
tributions, and is shown to be an improvement, in terms of operator 
norm and efficiency, over other commonly used MCMC algorithms. 
We also investigate methods for the estimation of the precision pa- 
rameter of the Dirichlet process, finding that maximum likelihood 
may not be desirable, but a posterior mode is a reasonable approach. 
Examples are given to show how these models perform on real data. 
Our results complement both the theoretical basis of the Dirichlet 
process nonparametric prior and the computational work that has 
been done to date. 

1. Introduction. Linear and generalized linear mixed models have be- 
come important statistical tools bringing more flexibility through the ad- 
dition of random effects to the more traditional linear models. A general 
mixed effects model can be written as 

(Yi, . . . , Y n ) ~ f(y 1 , . . . , y n \9, ip u . . . , tp n ) = JJ /(y;|<9, 

i 

ipi~G, i = l,...,n, 

where / and G are often taken to be normal. The addition of a link func- 
tion turns this into a generalized linear mixed model; we will discuss that 
below. The restriction of G to a normal distribution is sometimes thought 
to be limiting. For example, researchers in the social sciences are uncomfort- 
able with this assumption, wanting a more flexible, possibly nonparametric 
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structure here [Gill and Casella (2009)]. Another troublesome fact, as noted 
by Burr and Doss (2005), is that random effects, unlike error terms, cannot 
be checked (there are no residuals). Thus we are totally dependent on this 
uncheckable model assumption. 

One way of relaxing this assumption is with a richer, nonparametric model 
for ip with a popular alternative being the Dirichlet process 

ipi~W(m,<f)o), i = l,...,n, 

where T>V is the Dirichlet process with base measure <po and precision param- 
eter m. By moving to this model we not only relax the normal assumption, 
but also provide a richer model for the random effects. Such a model has 
potential for capturing more types of variability in those effects with the 
possible end result of more precise estimates of the fixed effects. Here we 
will look at ways to fit such models, focusing on methods of estimating the 
precision parameter m, and evaluating and improving on Gibbs samplers for 
the models. 

1.1. Background. Dirichlet process mixture models were introduced by 
Ferguson (1973) who defined the process and investigated basic properties. 
Blackwell and MacQueen (1973) showed that the marginal distribution of 
the Dirichlet process is equal to the distribution of the nth step of a Polya urn 
process. In particular, they proved that for tpi,...,ip n i.i.d. from G ^VV, 
the joint distribution of ?/> is a product of successive conditional distributions 
of the form 

m 1 

(i) vdV'i, - • • , ^ ~ - — 7- — M^i) + - — 7- — y2$(i>i = A), 

i — l + m i — l + m / —' 

l=i 

where 5 denotes the Dirac delta function. 

Other work that characterizes the properties of the Dirichlet process in- 
cludes Korwar and Hollander (1973) and Sethuraman (1994). Work that has 
particular importance for our development is that of Lo (1984), who derives 
the analytic form of a Bayesian density estimation that is generated by con- 
voluting a known density kernel with a Dirichlet process, and Liu (1996), 
who derives an identity for the profile likelihood estimator of to. 

The implementation of the Dirichlet process mixture model has been made 
feasible by modern methods of Bayesian computation and efficient algo- 
rithms. The work of Escobar and West (1995) and MacEachern and Muller 
(1998) developed estimation techniques and sampling algorithms, and Neal 
(2000) provided an extended and more efficient Gibbs sampler. 

Note that the representation (1) induces clusters in the random effects 
since with positive probability the value of ipi is equal to one of the previous 
values. McCullagh and Yang (2006) showed that the marginal distribution 
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of these Dirichlet clusters can be derived using cycles of integers and ex- 
changeability based on Pitman (1996), and an exchangeable cluster process 
can be generated by a standard Dirichlet allocation scheme. Quintana and 
Iglesias (2003) show that the joint marginal distribution of the Dirichlet 
observations can be expressed as a product partition model. Such models 
were introduced by Hartigan (1990) and Barry and Hartigan (1992), and 
are based on modeling random partitions of the sample space. 

It is interesting to note that with n observations from a Dirichlet pro- 
cess with precision parameter to, the marginal distribution of a partition 
{ni, 7i2, • • • j nk}, where j n j = n > n j ^ 1> is given by 

(2) 7r(ni,n2,...,nfc)= \> m k T\r(n j ), 

1 (to + nj 

which is a normalized probability distribution on the set of all partitions 
of n observations. This is the same distribution derived by McCullagh and 
Yang (2006) using cycles of integers (which can be related to the partitions) 
and has been used by other authors [Crowley (1997), Booth, Casella and 
Hobert (2008)] as a prior distribution on clusters for a Bayesian clustering 
algorithm. 

Most of theoretical and computational work for the Dirichlet process mix- 
ture models focus on the efficient estimation of 4>o. For the Dirichlet prior, 
we also need to consider the precision parameter to because it strongly in- 
fluences the number of distinct components, which is the distribution of the 
underlying random effects. The estimation of to has had difficulties in im- 
plementation due to computational intractability. The number of distinct 
components is not known in practice and it can be changed if new data are 
observed. Doss (2008) notes that the precision parameter is typically the 
most difficult to estimate or defend fixed value. 

The approach of Liu (1996), for the estimation of to, is to use sequential 
imputation to estimate to using maximum likelihood and treating the sub- 
clusters as missing data. Naskar and Das (2004, 2006) estimated m using 
Monte Carlo EM but did not investigate the properties of the solution. Our 
approach is based on using marginal or profile likelihood for to, and using 
the Gibbs sampler to estimate the model parameters. This is a variation 
of Casella (2001) who showed that, using an empirical Bayes approach, the 
hyperparameters can be estimated in a computationally feasible way. 

1.2. Summary. In this paper, for a mixed Dirichlet random effects model, 
we develop algorithms for estimation of the precision parameter and MCMC 
algorithms for fitting the models. We focus on linear models but also show 
how to extend our results to a generalized Dirichlet process mixed model 
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with a probit link function. In Section 2 we develop our model and its pro- 
bit extension using a new parameterization of the Dirichlet mixed model. 
Section 3 shows how to estimate the precision parameter, m. There we see 
that there is, in fact, not much information in the data about m (as it only 
depends on the size of the subclusters). We find that the MLE of Liu (1996) 
can be unstable and show how to obtain a more stable posterior mode esti- 
mate. Section 4 derives a Gibbs sampler for the model parameters and the 
subclusters of the Dirichlet process, and we use our new parameterization of 
the hierarchical model to derive a new Gibbs sampler that more fully exploits 
the structure of the model and mixes very well. We can adapt the results 
of Hobert and Marchev (2008) to establish that our sampler is an improve- 
ment, in terms of operator norm and efficiency, over other commonly used 
algorithms. Section 6 given details for the estimation of the precision param- 
eter, and Section 7 contains illustrative applications. Section 8 summarizes 
these contributions and adds some perspective, and there are a number of 
technical appendices. 

2. Models and likelihoods. A general random effects Dirichlet model can 
be written as 

(3) (Y 1 ,...,Y n )^f(y 1 ,...,y n \e^ 1 ,...,i; n ) = llf(y i \e,^ i ), 

i 

ipi ~ P?(m> ), i = l,...,n, 

where VP is the Dirichlet process with base measure 0o and precision pa- 
rameter m. The vector 9 contains all of the model parameters which we will 
address a bit later. 

Applying the Blackwell-MacQueen formula (1), we can calculate the like- 
lihood function, which by definition is integrated over the random effects, 
as 



where 



(4) 7r(Vi,...,Vn) = I 



i=i 



m + i — 1 



From Lo's (1984) Lemma 2 and following the development in Liu's (1996) 
Theorem 1, we can evaluate this integral to get 



C:\C\=kj=l 
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where C defines the subclusters, yy) is the vector of y^s that are in subclus- 
ter j and tpj is the common parameter for that subcluster. There are S Ut k 
different partitions C, the Stirling number of the second kind. 

A partition C clusters of the sample of size n into k groups, k = 1, . . . , n, 
and we call these "subclusters" since the grouping is done nonparametri- 
cally rather than on substantive criteria. That is, it is likely that any real 
underlying clusters would be broken up into multiple subclusters by the non- 
parametric fit since there is little penalty for over-separation. Recall that 
the regular GLMM assumes that the random effect tpiS are independent 
and identically distributed with the normal distribution iV(O,0"3,). However, 
the subclustering assigns different normal parameters across groups and the 
same parameters within groups; cases are i.i.d. only if they are assigned to 
the same subcluster. 

2.1. A matrix representation of subclusters. Each partition C can be 
associated with annxi matrix A defined by A' = (a'i, a' 2 , . . . ,a' n ) where 
is a 1 x vector of all zeros except for a 1 in one position that indicates 
which group the observation is from. Note that the column sums of A are 
(ni, ri2, • • • ,nk), the number of observations in the groups, and there are, 
of course, S n ^ such matrices. Specifically, if the partition C has groups 



, Sfc}, then if i E Sj, ipi 
Arj. For example, if S\ 



{Si,. 
as ij) 



(5) 



We then have 



rjj and the random effect can be rewritten 
: {3,4,6}, 5 2 = {1,2}, S 3 = {5}, 
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C:\C\=kj = l 



= E II r ^) f(y\e,Ar,)Mv)dv, 

AeA k j=i 

where Ak is the set of all matrices A and rjj ~ <fio, independent. If we define 



(6) f(y\0,A) = J f(y\6,Ar,)Mv)dv, 

the likelihood function is 

w ^(%) = f(^i> fe E n r te)/(yiM). 



k=l 



AcA k j=l 
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Note that if the integral in (6) can be done analytically, as will happen in 
the normal case discussed next, we have effectively eliminated the random 
effects from the likelihood, replacing them with the A matrices, which serve 
to group the observations. 

2.2. Marginalizing the random effects. Start with a normal linear model, 

Y\i/j~N(Xp + il>,a 2 I), 

where xj) = • • • , ipn)', ipi ~ T>V{m, N(0, r 2 )), i = l,...,n, independent. 
When we introduce the A matrices we get the models 

Y|T7,A~iV(X/3 + AT7,a 2 /), r)~N k (0,T 2 I), 

and marginally, Y is multivariate normal with 

EY = X/3, VarY = a 2 I + t 2 AA'. 

Moreover, in this model we can analytically marginalize out some of the 
model parameters. See Section 4. 

2.3. Consistency. We briefly discuss posterior consistency of these mod- 
els, noting that they have been extensively examined by Ghosal and co- 
authors [see, in particular, Ghosal, Ghosh and Ramamoorthi (1999) and 
Ghosal (2009)]. Ghosal (2009) defines a Mixture of Dirichlet Process (MDP), 
where the Dirichlet process is the error distribution, with possibly additional 
hyperparameters for the base measure. Alternatively, there is the Dirichlet 
Process Mixture (DPM) 3 where, for a parametric density, we assume there 
is a latent variable from a unknown and unrestricted distribution, modeled 
with a Dirichlet process prior. The famous inconsistency result of Diaconis 
and Freedman (1986), and the results of Doss (1985a, 1985b), are for the 
MDP model while our Dirichlet random effects model is an example of DPM 
with a normal density for the observations. 

Ghosal (2009) showed that for the MDP model, the conditions of the 
general consistency theory of Schwartz (1965) are not satisfied due to the 
discreteness of the resulting Dirichlet likelihood; thus the posterior from the 
MDP model can be inconsistent. However, the DPM likelihood is smooth 
enough to satisfy the conditions of Schwartz's theorem, and thus posterior 
consistency holds. Therefore, it follows that the posterior of our proposed 
Dirichlet random effects model is consistent. 

We conducted a small simulation study to examine the behavior of the 
posterior estimates of the coefficients in a linear DPM model. In this sim- 
ulation, we fix the number of subclusters, k, to be 20% of sample size, and 



3 It appears that the terms Mixture of Dirichlet Process and Dirichlet Process Mixture 
and not used unambiguously in the literature. 
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the concentration parameter m by using (18). Other parameters are fixed 
at a 2 = 1, t 2 = 4 and (3 = (/3 ,/3i)' = (1,2)'. From Table 1 we see that the 
standard deviations get smaller as sample size bigger, and the estimates are 
getting close to the true value as n gets bigger, illustrating the posterior 
consistency for the linear Dirichlet random effects models. 

2.4. A probit mixed Dirichlet random effects model. A generalized lin- 
ear mixed model (GLMM) can be specified to accommodate outcome vari- 
ables conditional on mixtures of possibly correlated random and fixed effects 
[Breslow and Clayton (1993)]. For example, assume that there is a Bernoulli 
selection process where we observe Yi according to 

(8) Y{ ~ Bernoulli (pi), i = l,...,n, 

where yi is 1 or 0; thus pi = E(Yi) is the probability of a success for the 
ith observation. Moreover, using a link function g(-), we can express the 
transformed mean of Yi as a linear function, 

(9) g(Pi) = XiP + il>u 

where Xj are covariates associated with the ith observation, (3 is the co- 
efficient vector, and ipi is a random effect accounting for subject-specific 
deviation from the underlying model. The ipiS are usually assumed to be 
distributed as N(0,a^). 

Variations of GLMMs were used by Dorazio et al. (2008) to model spatial 
heterogeneity in animal abundance, and Gill and Casella (2009) modeled 
political science data by using a GLMM with an ordered probit link. For 
Bayesian inference, Albert and Chib (1993) used a Gibbs sampler by intro- 
ducing a latent variable into the model and noted that the probit model on 
the binary response is connected with the normal linear model on a contin- 
uous latent data response. Mukhopadhyay and Gelfand (1997) used a fully 
Bayesian approach to fit generalized linear models with Dirichlet random ef- 
fects, and Ghosh et al. (1998) proposed hierarchical Bayes generalized linear 
models for longitudinal GLMMs in small area estimation, and provided a 
general theorem that ensures the propriety of posteriors under diffuse priors. 



Table 1 

Estimation of the coefficient parameters: Posterior means and standard deviations 



Condition 




m 


00 


/3i 


n = 100 and k = 


20 


7.246 


0.618 (0.114) 


1.945 (0.110) 


n = 500 and k = 


100 


37.318 


0.927 (0.046) 


1.986 (0.045) 


n = 1000 and k = 


= 200 


74.915 


1.051 (0.032) 


2.040 (0.033) 
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Although we can proceed to develop estimators with a general link func- 
tion (9), choosing g to be the probit link greatly simplifies things, which is 
what we will do. If we introduce the latent variable 

(10) Ui-NiXrf + ^a 2 ), 

then defining Yi = I{U% > 0) results in the probit model. Thus, if we consider 
the hierarchy to start with the C/j, we are in exactly the model of Section 
2.2. To fit the probit model we then add an additional step to the Gibbs 
sampler to generate the [/, conditional on the remaining parameters. This 
results in a truncated normal random variable generation, with details in 
Appendix A. 2. 

3. Estimating the precision parameter. In this section we look at the 
performance of the maximum likelihood estimates of m using both profile 
likelihood and marginal likelihood. We find that these estimates can be 
unstable, and hence suggest an alternative based on a posterior mode, which 
we ultimately implement with importance sampling in Section 6. 

3.1. Maximum likelihood estimates. From (7) define 

k 

(11) A(%) = E n r K-)/(y|^)- 

AeA k j=i 

Then, as in Liu (1996) or Doss (1994), we can obtain a profile likelihood 
estimate of m. That is, for each fixed value of we can differentiate the log 
of (7) and set it equal to to obtain the profile MLE as the solution to 

n2) Yl=ikm k - l C k {6\y) _^ 1 

1 ; ELi™ fc A(%) ^m + i-l 

which defines fh(6), the profile MLE. 

This development can be easily modified to obtain the marginal likeli- 
hood estimate of m, rh. Under the usual regularity conditions, when rh and 
rh{0) are both consistent estimates, we would expect that the marginal MLE 
would be a more stable estimate of the precision parameter. 

Starting from (11) we can also integrate over to obtain the marginal 
likelihood, 

(13) C k (y) = j C k {0\y)d0. 

The same development as above will lead to the expression (12) with C k {0\y) 
replaced by C k (y). 

We also note that expressions for the approximate variances of either rh 
or rh{0) can be easily attained from the second derivative of the appropriate 
log likelihood evaluated at the estimate. 



DIRICHLET RANDOM EFFECTS MODELS 



9 



3.2. Solving the likelihood equations. Note that, for either profile or marginal 
likelihood, we can write the log likelihood function as 

(n \ n 

^m fc c fc - \og{i - 1 + m), 
k=l J i=l 

where ct = Ck{0\y) for profile likelihood and Ck = £fc(y) for marginal likeli- 
hood. The derivative of the log-likelihood is given by 




dm Y^k=i m k Ck i — 1 + m 

It is straightforward to show that 

(16) lim -?-l(m) = — 

m->o dm c\ 

lim sgnf— — £(m) ] = sgn 
m->oo V dm I 

x ' \%=i / 

where sgn(-) is the sign of the function. Note that lim m _ i . 00 -£^£(m) = 0, 
but the direction of approach is important. The signs of the derivatives at 
the extremes are only functions of the ratios of the c^, the pieces of the 
likelihood. (We can assume, without loss of generality, that Ylk c k = !■) 

We would typically expect the sign of the derivative at m = to be pos- 
itive. Otherwise one of two cases could occur. If the derivative starts out 
negative and never changes sign, it will be the case that m = which im- 
plies that the Dirichlet model collapses back to the base model. If not, the 
derivative would cross from negative to positive, implying that a solution to 
the likelihood equations could result in a minimum, and thus we must exer- 
cise more care in finding the MLE. In either case, the sign of the derivative 
at m = depends on the values of the likelihoods with partition sizes k = 1 
and k = 2. 

We would also like the limiting sign as m — > oo to be negative, otherwise it 
could be the case that m = oo or, depending on the sign at 0, there could be 
multiple interior extrema. If c n is close to 0, then the limit can be positive. 
Note also that the first term of the sign of the derivative equals ( n 1 )( n 2 ) ; 
so for the sign of the derivative to negative, should be greater than 
(n-i)(n-2) ^ a j^jjjbgj. that grows rapidly with the sample size. Thus, the 
sign of the derivative at m = oo depends on the likelihoods for partition 
sizes k = n — 1 and k = n, and the sample size n. 
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(17) 



As an illustration of the possible shapes of the likelihood function, we 
consider a number of simple cases for n = 6: 

d 

c= (0,0,0,0,0, 1), 7— £(m) is t so m = oo; 

dm 

d 

c= (1,0,0,0,0,0), — £{m) is 1 so fh = 0; 

dm 

d 

c = (1, 1, 1, 1, 0, 0), — — £(m) is — I — so there is a minimum 

dm 

and maximum; 
d 

c= (1, 1, 0, 0, 1, 1), — — £(m) is — h so there is a unique minimum; 

dm 

d 

c = (1, 0, 0, 0, 0, 1), — — £{m) is H — so there is a unique maximum. 

dm 

Thus the likelihood function can have a variety of shapes, which are also 
illustrated in Figure 1. Moreover, we also note that the likelihood of m 
can be very flat, meaning that there is insensitivity to the value of the 
MLE. Referring to (15), the MLE actually corresponds to finding the m 
that solves this equation. Liu (1996) referred to this as the equating of 
prior and posterior means (where the expectation is taken using the discrete 
distribution with weights m k Cf c ) as the right-hand side of this equation can 
be interpreted as a prior number of clusters. For example, if we define k by 

n 

, , v - m 
18 k=> : — r, 

i=l 

then k is the expected number of prior clusters. Even though m can be quite 
variable, there is less variability in k = X^=i m+i-i • 

3.3. Posterior mode estimates. Since the likelihood of the precision pa- 
rameter depends on the likelihoods from the sub-cluster size k = 1,2, n — 1 
and n, this reflects an insensitivity to the likelihood and the sample size 
n. For example, when are equal for all k, lim m ^.o j|^( m ) > 0) but 
linim^oo J^£{m) > 0. Thus, in this case, we easily get the MLE as m = oo. 

Given the potential problems with using and MLE for m, we consider a 
prior on m that results in a unique value of the posterior mode. One of the 
candidates is a gamma distribution with the shape parameter a and scale 
parameter b. Using the prior g(m) = p(a)fe a m a ~ l e~ m l b we have 

(19) my) = ^^9{rn)jZrn k £ {[ T(nj)f(y\0, A). 
{ ' k=i AeA k j=i 
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Fig. 1. Log likelihood functions for a selection of configurations of component likelihoods 
given in (17). The vectors c are the configuration of the marginal likelihoods. 



If we now take logs and differentiate, this amounts to adding the factor 
^Er- — t to (15). The result of this is that the derivative of this log posterior 
increases from m = and decreases as m — > oo, guaranteeing an interior 
global maximum. If we integrate (19) we then get the marginal posterior for 
m, which behaves in a similar way. 

3.4. Simulation study of a linear Dirichlet mixed model. Using the nor- 
mal linear model of Section 2.2, we conducted a simulation study with a 
gamma prior, m ~ gamma(a, b) to study the behavior of the estimates of m. 
We take n = 6, k = 3, \i = 0, r 2 = 1 and /3 = (1,2, 3). The Dirichlet process 
on the random effect vj) has precision parameter m and base distribution 
Go = N(0,t 2 ). We simulated 100 data sets by generating X\ and X2 from 
iV(0, 1) using the fixed design matrix to generate Y. 
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In this setting, k = 1, . . . , 6, and from the calculation of the Stirling num- 
ber of the second kind, there are 1,31,90,65,15,1 possible subclusters, re- 
spectively. The matrices A associated with these subclusters can be gener- 
ated, and we sum up the likelihood with all possible subclusters for each k 
for the profile likelihood. 

We estimate m with various settings of the prior mean and variance. 
With n = 6, k = 3 and m = 5, the solution of m from equation (18) is 
m = 1.70. The numerical summary is given in Table 2 and the histogram 
of the estimated k is given in Figure 2. For the estimation of k, we use the 
posterior mean of m, m and calculate k by using equation (18). 

From Table 2, we observe that if the prior mean ab is close to fh = 1.70, 
we get a good estimate of k that is close to the fixed k = 3. However, if the 
prior variance is too big, then the estimate of k is less precise. Also, from 
Figure 2, we observe that the histogram of the estimated k with ab = 2 is 
almost symmetric at k = 3.10 with small variance. 

4. A Gibbs sampler for the model. We describe a general Gibbs sampling 
scheme that iteratively generates A matrices and then model parameters 
assuming that m is fixed at either the MLE or posterior mode. Details on 
the estimation of m are in Section 6. 

Start with the joint likelihood, 

(20) L(9,A\y) = ^ m J, g(rn)m k l[T(n 3 )f(y\e,A). 

With a flat prior on A and n(8), we get the joint posterior distribution as 

m k f(y\6,A)Tr(e) 



(21) *(M|y) 



Table 2 

For n — 6, k~3, m= 5 and various values of the prior parameters, we estimate the 
precision parameter m and its transformed value k. Standard errors are in parentheses 



Condition 


m 




ab = 2 and ab 2 = 10 


1.85 (0.06) 


3.10 (0.03) 


ab = 3 and ab 2 — 10 


2.56 (0.14) 


3.47 (0.07) 


ab = 4 and ab 2 = 10 


3.07 (0.31) 


3.67 (0.14) 


ab = 2 and ab 2 = 100 


1.98 (0.01) 


3.18 (0.01) 


ab = 3 and ab 2 = 100 


2.95 (0.02) 


3.63 (0.01) 


ab = 4 and ab 2 = 100 


3.90 (0.02) 


3.95 (0.01) 
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2.90 



2.95 



3.00 



3.10 



3.15 



Fig. 2. A histogram of the estimated k with prior mean 2 and variance 10. 

Based on (21), the full conditional posteriors of 9 and A are 

m k f(y\6,A)7T(0) f(y\0,A)n(9) 



<0\A,y) 
<A\9,y) 



Je m k f(y\9, A)tt(9) d9 f e f(y\0, A)x(9) d9 ' 
m k f(y\9,A)7r(0) m k f(y\9,A) 



J2 A ™ k f(y\e, AM9) y.a ™ k f(y\9, A) ' 

We now outline a Gibbs sampler that will generate from these conditionals 
by generating n x n A matrices and recovering the subcluster size through 
marginalization. 

For t = l,...,T, at iteration t: 

1. Starting from (0®,A®), 

(22) ~7r(0|A<*\y). 

2. Given 9^ +1 \ 

q (m) =(^ +1) ,...wi m) ) 

(23) 



~ Dirichlet(nf ) + ri, . . . ,n^' + r k ,r k+1 , . . . ,r„), 
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where n'i H \-n' n = n with k' of the n'j > 0. Sampling of the model param- 
eters 9 in (22) is straightforward (Appendix A.l), so we will concentrate on 
the sampling of A and q. 

The matrix A is n x n with column sums n\ , . . . , n n , and the columns 
with zero sums will be removed to obtain an n x k! matrix, according to 
Appendix B. Here we keep the rj as a general choice, but we will see in 
Section 5.2 and Appendix C.l that we should choose rj = 1 for all j. 

The transition kernel of this Markov chain is 

(24) k((8,A),(8',A'))=7r(0'\A,y) [ P(A'\q,e')f(q\A)dq 

JQ 

with 

m . »*/(y|^)(„,»„Jn^ 



£^*/(y|M)(„,?jn7 =1 <i 

and 



3 

J 



/(q|A)= J FTlll 9 / 11 ? i • 

Now we take rj = 1, and then we can express the multinomial as 

m-njll^ n ^ =1 r(n i + i)ii^ 

because the zero valued n^s take care of themselves. With this choice of rj, 
the transition kernel has Tr(9,A\y) as its stationary distribution; details are 
given in Appendix C.l. 



5. Generating the subclusters. In this section we discuss two aspects of 
generating the subclusters. First, we address how to generate according to 
(23). Then we examine convergence rates and establish that our sampler is 
an improvement, in terms of operator norm and efficiency, over commonly 
used algorithms. 

5.1. Generating the matrix A. Generation of the matrix A can be accom- 
plished by using a Gibbs sampler on the rows of A. Recall that <Zj, i = 1, . . . , n 

are the rows of A. Define A_i to be the matrix A with the ith row removed 

(£) 

and a\ to be a vector of zeros with a 1 in the £th position. The matrix 
{a\ , A_ t ) has column sums rip with k^ of rip > 0. Then for i = 1, . . . ,n, 

P(ai = a<V2) 
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where we update A_ t in the usual (Gibbs sampling) way. 

Alternatively, we can use a Metropolis-Hastings algorithm with a can- 
didate taken from a multinomial/Dirichlet as described in Appendix B.2. 
Based on the value of the qj in (23) we generate a candidate A from the 
multinomial and then remove the columns with column sum 0. That is, gen- 
erate an n x n matrix where each row is a multinomial, and the effective 
dimension of the matrix, the size of the subclusters, are the non-zero column 
sums. Deleting the columns with column sum zero is a marginalization of 
the multinomial distribution. The probability of the candidate follows from 
Appendix B.2, and the Metropolis-Hastings step is then done. 



5.2. Convergence properties. From (23), we see that given the subclus- 
ters, the sampling of the model parameters from ir(0\A,y) is straightforward. 
Thus, in investigating convergence we are only concerned with the conver- 
gence of the Markov chain on the subclusters. Clearly, if convergence is 
improved for this part of the chain it will then transfer to the entire chain. 

If we ignore the model parameters, then we are concerned with conver- 
gence of the chain to the stationary distribution (2), that is, 

(25) tt(A) = 7r(m, ...,n k ) = - ™ m k J] T( nj ), 

and first we derive the full conditionals in the following way. Start with 
(ni, . . . , n k ) with sum n — 1, and generate a new row of the A matrix. The ma- 
trix A is n x k, and when we generate a new row, either the dimension will 
remain n x k or we will increase to n X k + 1. If we write a = {aj}, then 

P( aj = 1, m, . . . , n k ) = - [ 2 m k T( nj + 1) TT T(n'j) for j = 1, . . . , k, 

1 ( n + m ) ±J - J 

j'=i 

i'ft 

P(aj = .. .,n k ) = r( ^ } m k+1 TT r(^) for j = k + l, 

I [n + m) 

j'=i 



P(ni, . . . , n k ) = =p£j£ X II r ^)' 
I (n — 1 + m) - LJ - 



k 
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This results in 

for j = l,...,k, 



n — 1 + m ' 

for j = k + 1, 



(26) P(aj = l\ni,...,n k ) = < 

, n — 1 + m ' 

which are exactly the full conditionals derived by Neal (2000), his equation 
(3.6) ignoring the model parameters, using a limit argument starting from a 
finite-dimensional Dirichlet. The Gibbs sampler based on (26) is the basis for 
most of the eight algorithms that he describes; some of which were originally 
developed by other authors. 

The Gibbs sampler resulting from (23), ignoring the model parameters, 

is 

(I»/r(n + m))m k n* =1 r(«i)( B1 "J IljLi If 



(27) 



TV i \ fe n 



and a similar argument shows that the full conditionals from this chain are 

f rij qj 



(28) P(aj = l\ni, . . . ,nk) oc < 



for j = l,...,k, 
7i j + 1 n — 1 + ?n 

m 

-gj, for j = k + 1, . . . ,n. 



, n — 1 + m 

Notice that for qj = rij + 1, j < k and qj = l,j > k (the normalization does 
not matter), we see that Neal's Gibbs sampler (26) is the same as (28). We 
can therefore write the transition kernel of (26) as 

K N (A,A') = P(A'\ q °)g(c l \A), 

where ^(q !^) is a point mass. In this notation, the kernel of (28) is 

K(A,A') = [ P(A'|q)/(q|q ) fl (q°|A)dq, 

JQ 

where /(q|q ) is the same as /(q|yl) in (27). The vector q° merely serves 
to pass the rij. 

We are now in the setup of Hobert and Marchev (2008) and can use their 
Theorem 3 to establish the superiority of K(A,A') over Kn(A,A'). 

Theorem 1. For the transition kernels Kn(A,A') and K(A,A'), both 
with stationary distribution given by (25): 

1. K(A,A') dominates K^{A, A') in operator norm; 
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2. K(A,A') dominates Kn(A,A') in the efficiency ordering of Mira and 
Geyer (1999) [see also Mira (2001)], which implies that, for any square- 
integrable function h, the asymptotic variance is smaller using K(A,A') 
than using K^(A,A'). 

Proof. In the terminology of Hobert and Marchev (2008), Kn(A,A') 
is in the form of a Data Augmentation (DA) algorithm, and K(A,A') is 
a parameter-expanded version of Kpj(A, A'). The theorem will be estab- 
lished if we can show that K(A,A') is reversible. This is straightforward as 
K(A,A') is, itself, a DA algorithm since K(A, A') = J Q P(A'\q)f(q\A) dq. 
To be specific, if we take r, = 1, then K (A, A') satisfies the detailed balance 
condition K(A, A')ir(A) = K(A', A)ir(A') (Appendix C.2). □ 

Therefore, in the estimation of any square integrable function h, using 
(28) will result in a smaller variance than obtained by using (26). 

5.3. Assessing the improvement. The results of Section 5.2 show that 
our sampler should mix better than "Stickbreaking" as defined by (26). 
Although we do not know the amount of potential improvement, the results 
of Roy and Hobert (2007) suggest that there are substantial gains to be had. 

To assess the amount of improvement of the Gibbs sampler, the following 
simulation study was done. For the linear Dirichlet mixed effects model 
described in Section 2.2 we simulated four data sets. For n = 100 we took A 
matrices corresponding to 1, 5, 25 and 100 groups in the data and ran 20 
Markov chains, each for 500 iterations. At each iteration we calculated the 
variance of the 20 cumulative means which are displayed in Figure 3. 

As can be seen, the improvement over the "Stickbreaking" algorithm can 
be quite substantial; in most cases we see almost a 50% percent improve- 
ment. Although we are not claiming that this will hold in all cases, we have 
a clear indication that substantial reduction in Monte Carlo variance can be 
attained. 

6. Importance sampling the precision parameter. To estimate the pre- 
cision parameter m we want to work with a marginal likelihood function in 
the form of (7). Based on the development in the previous sections, we start 
with the marginal posterior, 



(29) 




where 




k 
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100 200 300 400 500 100 200 300 400 500 

iterations Iterations 




Fig. 3. For n = 100, comparison of variance estimates using the "Stickbreaking" algo- 
rithm of (26) (S-B, dashed line) and the algorithm given in (28), the "Dirichlet Random 
Effects Model" (DREM, solid line). The four plots correspond to four underlying distri- 
butions of 1, 5, 25 and 100 groups. Twenty Markov chains were run, and the variance of 
the 20 estimates was calculated at each of the 500 iterations. 



To take advantage of this functional form for the estimation of m, we want 
to calculate fk(y) for each fe = 1, ...,n. However, this strategy is difficult 
to implement for a number of reasons. First, it would necessitate running a 
full Gibbs sampler (or other MC technique) for all k = 1, . . . ,n. Second, the 
implementation is problematic. For example, consider using an importance 
sampler based on simulating from the model 

e~f(y\0,A), 

(30) a,i ~ Multinomial(l, (q%, . . . ,<&)), independent, 

q= (qi,...,q k ) ~ Dirichlet (a, a), 
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which leads to the joint posterior distribution 

(31) M . /(y|M) _gL_/f[ 9r -- 1(iq 



3=1 * W " 3=1 
k 



Tin + ka) - LJ - Tla) 



where rij = ^i a «r Unfortunately, there is no guarantee that rij > 0, and 
samples with rij = will have to be discarded. 

However, we can proceed as in (23), and modify (30) to use an n-dimensional 
Dirichlet, 

q= (qi, ...,q n ) ~ Dirichlet(a, . . . ,a), 

and then generate Oj independently from this Dirichlet. We then eliminate 
from the A matrix all columns whose sum is zero. The resulting value of k 
has the distribution given in (31) because of the marginalization properties 
of the multinomial and Dirichlet. 

The simulation strategy is the following. For t = 1, . . . , T we generate 

A (t) 

according to (30) but using the n-dimensional Dirichlet, and then marginal- 
ize to the number of nonzero rij. We then gather the A® according to their 
values of k. Then, for each k, if there are matrices A of that size, we 
estimate fk by 

fk(j)= [ E Y[T(nj)f(y\0,A)ir(e)dB 

= y ffr T(nj)f(y\6,A)7r(e) 

A 7l k '°J=i f(y\ > A)w(9)(T(ka)/T(n + ka)) Uj=i^ n 3 + a )/ T ^)) 



r(n + ka)T(a) k 1 ^ y T{nf) 

3 

fk(y), 



«) 3ign r(n (t )+a) 



where we see very clearly that m only depends on the rij. We now use fk(y) 
in (29) to obtain the marginal MLE of m. 

We can further reuse these random variables for all k' < k by randomly 
choosing two columns and adding them together. This results in an A matrix 
of one fewer dimension. Details are given in Appendix B. 
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7. Application. In this section we use the Gibbs sampler for a general- 
ized linear mixed model with a Dirichlet process random effect term and 
probit link to analyze survey data from Scotland. On September 11, 1997, 
an overwhelming 74.3% of Scottish voters approved of the establishment of 
the first Scottish national parliament in nearly three hundred years, and 
on the same ballot the voters gave strong support, 63.5%, to granting this 
parliament taxation powers. This vote represents a watershed event in the 
modern history of Scotland which was a free and independent country sepa- 
rate from England until 1707. This vote is part of the Labour government's 
decentralization program and there is still uncertainty about the future role 
of Scottish government with the United Kingdom and the European Union. 
What we are interested in here are those who subsequently voted for the 
Conservative (Tory) party in Scotland and whether such a vote is intended 
to mitigate Labour's devolution program in Scotland. 

The data come from the British General Election Study, Scottish Elec- 
tion Survey, 1997 (ICPSR Study Number 2617). These data contain 880 valid 
cases, each from an interview with a Scottish national after the election. Our 
outcome variable of interest is their party choice in the UK general election 
for Parliament where we collapse all non-Conservative party choices (ab- 
stention, Labour, Liberal Democrat, Scottish National, Plaid Cymru, Green, 
Other, Referendum) to one category which produces 104 Conservative votes. 
The chosen explanatory variables are intended to explain this choice and in- 
clude two measures of political efficacy: POLITICS, which asks how much 
interest the respondent has in political events (increasing scale: none at all, 
not very much, some, quite a lot, a great deal), and READPAP, which asks 
about daily morning reading of the newspapers (yes = 1 or no = 0). It is also 
important to establish party identity separate from vote choice, PTYTHNK, 
and how strong that party affiliation is for the respondent (categorical by 
party name), IDSTRNG (increasing scale: not very strong, fairly strong, very 
strong). 

We also look at respondents' views on various policy issues. The variable 
TAXLESS asks if "it would be better if everyone paid less tax and had to 
pay more towards their own healthcare, schools and the like" (measured 
on a five point increasing Likert scale). DEATHPEN asks whether the UK 
should bring back the death penalty (measured on a five point increasing 
Likert scale). LORDS queries whether the House of Lords should be reformed 
(asked as remain as is coded as zero and change is needed coded as one). 
The question SCENGBEN asks how economic benefits are distributed between 
England and Scotland with the following choices: England benefits more 
= —1, neither/both lose = 0, Scotland benefits more = 1. The important 
question INDPAR asks which of the following represents the respondent's 
view on the role of the Scottish government in light of the new parliament: 
(1) Scotland should become independent, separate from the UK and the 
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European Union; (2) Scotland should become independent, separate from 
the UK but part of the European Union; (3) Scotland should remain part of 
the UK, with its own elected parliament which has some taxation powers; 
(4) Scotland should remain part of the UK, with its own elected parliament 
which has no taxation powers and (5) Scotland should remain part of the 
UK without an elected parliament. Relatedly, SC0TPREF1 asks, "should there 
be a Scottish parliament within the UK?" (yes = 1, no = 0). 

Finally, we use three demographic explanatory variables: RSEX, the re- 
spondent's sex, RAGE, the respondent's age, RS0CCLA2, the respondents social 
class (7 category ascending scale), TENURE1, whether the respondent rents 
(0) or owns (1) their household and a categorical variable for church affil- 
iation, measurement of religion is collapsed down to one for the dominant 
historical religion of Scotland (Church of Scotland/Presbyterian) and zero 
otherwise and designated PRESB. 

We set a 2 = 1 to establish the scale and provide an intuitive (standard) 
probit metric. This decision appears to have little influence on the resulting 
posteriors and allows the ip specification sufficient latitude to draw nonpara- 
metric information from the data. The parameters in the priors on \i and r 2 
are chosen to make the priors sufficiently diffuse to allow the random effect 
to do its work. In previous work [Gill and Casella (2009)], we observed little 
sensitivity to hyperparameter values. 

We ran the Gibbs sampler for 5000 iterations disposing of the first 2000. 
All of the common diagnostics (Geweke, Brooks-Gelman-Rubin, Heidelberger- 
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Table 3 

Posterior model quantiles, voting model 
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— 1.32 
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— 0.48 
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POLITICS 


0.12 


0.17 


0.23 


0.29 


0.44 


READPAP 


0.05 


0.17 


0.31 


0.45 


0.78 


PTYTHNK 


-0.81 


-0.75 


-0.68 


-0.62 


-0.45 


IDSTRNG 


0.14 


0.19 


0.25 


0.31 


0.46 


TAXLESS 


0.02 


0.07 


0.13 


0.19 


0.32 


DEATHPEN 


0.01 


0.05 


0.09 


0.14 


0.24 


LORDS 


-0.73 


-0.61 


-0.48 


-0.37 


-0.07 


SCENGBEN 


0.22 


0.30 


0.39 


0.47 


0.67 


SC0PREF1 


-1.41 


-1.29 


-1.14 


-1.00 


-0.65 


RSEX 


0.20 


0.30 


0.43 


0.56 


0.84 


RAGE 


0.01 


0.01 


0.01 


0.02 


0.03 


RS0CCLA2 


-0.23 


-0.19 


-0.15 


-0.10 


0.00 


TENURE 1 


-0.23 


-0.19 


-0.15 


-0.10 


0.02 


PRESB 


-0.41 


-0.29 


-0.17 


-0.04 


0.22 


INDPAR 


0.02 


0.15 


0.29 


0.44 


0.77 



Welsh, graphics) point toward convergence of the Markov chain to its sta- 
tionary distribution. Figure 4 is a cumulative mean plot for each of the 
dimensions for the entire t = 5000 period of the chain. 

Table 3 provides quantiles for the posterior marginal distributions. We 
observe that an interest in politics and regular reading of the newspapers 
increases the probability of voting Conservative as does (not surprisingly) 
supporting less taxes and the return of the death penalty. We see the same 
positive effect for men versus women, older versus younger and homeowners 
versus renters. Those with stronger party attachments are also more likely 
to vote for the Conservative party. Reforming the House of Lords, Presby- 
terians and those affiliated with more liberal parties are less likely to vote 
Conservative. 

Two results are surprising. First, those that think that economic policies 
benefit Scotland more than England are more likely to vote for the Conser- 
vative party which is much more aligned with English voters than Scottish 
voters in general. Perhaps there is a sense that Conservative voting brings 
attention to Scottish issues from the party. More surprisingly, favoring an 
independent party is positively associated with voting Conservative through 
the model. This was our key variable of interest and the relationship is not 
in the direction expected. The Conservative party is not generally favorable 
to devolution issues, so these voters are clearly cross-pressured. It is impor- 
tant to keep in mind that the new parliament has taxation powers and thus 
diminishes the power of local council authorities who are overwhelmingly 
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Fig. 5. Comparison of 90% credible intervals for the Dirichlet random effects model 
(black) to those from a normal random effects model (blue). 

associated with the Labour and Scottish National parties. Thus a Conserva- 
tive voter may welcome a more centralized taxation program with possibly 
less influence from these parties, at least at the local level. 

In terms of model fit, notice that, aside from the constant, only two 
marginal posteriors do not have 90% HPD intervals bounded away from 
zero. It turns out that by every common measure of fit the generalized lin- 
ear mixed model with a Dirichlet process random effect term outperforms 
a simple Bayesian probit model with diffuse uniform prior distributions on 
the parameters. Indeed, when we compare the lengths of credible intervals 
in Figure 5, we find that the Dirichlet model results in uniformly shorter 
intervals than those of a normal random effects model. Thus as we antici- 
pated in Section 1, the richer random effects model is able to remove more 
extraneous variability resulting in tighter credible intervals. We take this as 
evidence that the new procedure is capturing nonparametric information of 
interest. 
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8. Discussion. Our interest in models with Dirichlet random effects grew 
from modeling social science data, where scientists expressed concern over 
Bayesian models that used informative priors. The Dirichlet process ran- 
dom effects model helps to balance the information from the data and the 
belief of the researcher while still allowing a normal-type interpretation (in 
terms of means and variances). As noted previously, we are in agreement 
with the sentiment of Burr and Doss (2005), who note that random effects, 
unlike error terms, can not be checked (there are no residuals). Thus a model 
with normal random effects is a model of convenience, and moving to a richer 
model such as the Dirichlet process is a step in relaxing unverifiable assump- 
tions. In particular, the subclustering structure of the Dirichlet process may 
capture extra variation in the random effects that escape the normal random 
effects models. The fact that data analysis with the Dirichlet random effect 
model often differs substantially from the normal random effects model, as 
noted in Section 7, supports this claim. 

Representing the subclustering structure through the symmetric binary 
matrix A is not new. For example, such an equivalence representation was 
noted by McCullagh and Yang (2006). Here, the representation has proved 
useful not only in deriving alternative forms of the model but also in lead- 
ing to an improved Gibbs sampler. The influence of the random effects, as 
modeled with the parameters ip and r], is only felt through the matrix A, 
and in some cases these parameters may not have to be generated (see the 
representation in Section 2.2). This again will lead to a more efficient Gibbs 
sampler. 

The improvement in the Gibbs sampler, as described in Sections 5.2 and 
5.3, appears to come with an increase in computational effort, as we want 
to start each iteration with an n x n matrix A. However, due to the binary 
structure of A, such a matrix need never be generated. In particular, we 
can use the correspondence between a multinomial random variable and a 
discrete random variable to represent the n x n matrix A as an n x 1 vector 
a. If X ~ Multinomial(l, (pi, . . . ,p n )), we create a discrete random variable 
X* satisfying P(X* = j) =Pj. We then use X* to generate the rows of A. 
For example, if n = 6 and six samples of X* gives the vector (2, 2, 1,1,3,1), 
this represents a matrix A with row 1 having a 1 in column 2, row 2 having 
a 1 in column 2, etc., with the full matrix being the matrix A in (5). 

We started this project to investigate generalized linear models with 
Dirichlet random effects but quickly realized that dealing with m is of prime 
importance and concentrated on linear models to better understand the es- 
timation. As we have seen, ordinary likelihood could be problematic which 
may be a result of the fact that there is really very little information about m 
coming from the data. As we saw in Section 3, the information in the model 
about m is only contained in the subclusters, which makes it relatively im- 
portant to check that the results of the model as somewhat insensitive to 
the value of the estimated m. 
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Finally, we note that although we have concentrated on linear and pro- 
bit models, the results will apply directly to a wider class of generalized 
linear models. There are implementation problems with the Gibbs sampler 
that arise with models such as the logit, where one needs to use either a 
slice sampler, a Metropolis-Hastings step or a demarginalization with the 
Kolmogorov-Smirnov distribution [Andrews and Mallows (1974)]. We have 
looked at these implementations in Kyung, Gill and Casella (2009). However, 
these are all variations on the model and, when any general link function 
such as (9) is used, the Gibbs sampler for A and the estimation of m will 
remain the same. 

APPENDIX A: GENERATING THE MODEL PARAMETERS 
A.l. A linear model. For given A, the likelihood function is given by 

We add the following normal and inverted gamma (IG) priors: 

P\a 2 ~ N(0,a 2 I), 

h), 

a 2 ~ IG(a 2 , b 2 ). 

Then for fixed m and A, with A* = —%I + -^A'A, a Gibbs sampler of 
(/3,a 2 ,T 2 ,rj) is 

V \P, a 2 ,r 2 ,A,y~N k (Ia^A^j - X?) , A*~ l 

(3\a 2 ,T 2 , V ,A,y~N p ((I + X'Xr 1 X'(y-Ar J ),a 2 (I + X'X)- 1 ), 

( k 1 
T 2 \f3,a 2 ,T ] ,A,y^lGl- + a 1 ,-\T]\ 2 + bi 

a 2 1/3, r 2 , V , A, y ~ IG + a 2 , \ \y - Xf3 - A V \ 2 + \ |/3| 2 + b 2 \ . 

If we marginalize out T7, the joint posterior distribution of (f3,a 2 ,T 2 ) is 
7r fe (/3, a 2 ,r 2 \A, y) = j vr fc (/3, a 2 ,T 2 ,rj\A, y) drj 

1 \ (n+p)/2+a 2 +l / -t \ fe/2+ai+l 

1 N / 1 \ ^ m 2 /2+b2)/a 2 _ fel/r 2 



(33) r 2 ~IG(ai,&i 



X |^*|l/2 e -(y-^/3)'[l-A(A*)- 1 A'/ CT 2 ](y-X/3)/(2a 2 ) 

which leads to an alternate Gibbs sampler. 
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A. 2. A probit model. Here we need to consider the latent variable U 
such that 

(34) U i =X i p + il> i + TH, r)i~N(0,<T 2 ), 

and 

Jl, if C7i > 0, 

It can be shown that Y{ are independent Bernoulli random variables with 
the probability of success, pi = Q( Xi/3 ~^' ) where $ is the cdf of the standard 
Normal. 

For given A, the likelihood function of model parameters and the latent 
variable is given by 

L k (p,a 2 ,T 2 ,rj,U\A,y) 

n 

= Him > WiVi = 1) + I(Ui < 0)I( yi = 0)} 



i=l 



1 



n/2 / -. \ fc/2 



2na 2 

where U = (U±, . . . , U n ) and leads to the Gibbs sampler 
77|/3, a 2 , t 2 , U, A, y ~ N k ( \a*~ 1 A\\] - X0), A*~ l 



e V-xp-A v [,^ e 



V 2vrr 2 J 



/3|cr 2 , r 2 , r?, A,y- JVp((/ + X'X^X'W - Arj),a 2 (I + X'xy 1 ), 
T 2 \P,a 2 ,r),A,y ~ IgQ + ai, ^/j] 2 + 6 X , 



r ,?7, A,y ~ IG f + a 2 , - |U - Xp - Ar,\' + -|/3| 2 + b 2 J , 

for the model parameters. For the latent variable U, for i = 1, .. . , n, 
[7*1/3, r 2 , a 2 , ry, A, w ~ iV(A^/3 + (^) <9 a 2 )/^ > 0) if Vi = 1, 
[7*1/3, r 2 , a 2 , r/, A, Vi ~ iV(A^/3 + (^) <9 <r 2 )I([7* < 0) if ^ = 0. 
Here, we can marginalize out rj such that 

L k (P,a 2 ,T 2 ,U\A,y) = J L k (P,a 2 ,t 2 ,rj,V\A,y) drj 

n 

= > °) J (^ = X ) + 7 ^ ^ = °)> 

i=l 
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4*11/2 



X 



( 2vr(J 2)n/2( T 2)fc/2 

p -(U-X/3)'[/-A(A*)-M'/ ( t 2 ](U-X/3)/(2 ( 7 2 ) 



x e 



where A* 




4'A 



APPENDIX B: MARGINAL DENSITIES 



In this appendix we give the details for the calculation of marginal den- 
sities of the Dirichlet, multinomial and their mixture. 

B.l. Dirichlet. Starting with an n-dimensional Dirichlet distribution, 
the marginal distribution of any k components is also Dirichlet. This corre- 
sponds to extracting the rows with non zero column sums in the A matrix 
in (23). That is, if (qi, . . . ,q n ) ~ Dirichlet(ri, . . . ,r n ), then for k <n 



as given in Ferguson (1973). 

A special case of this result is the combining of two rows which is the 
marginalization that we use in the calculation of the estimate of m (see 
Section 6). 

If q = (qi, . . . , q n ) ~ Dirichlet (ri, . . . , r n ), then for any k and k + 1 < n 



B.2. Multinomial. For (X\, . . . , X n ) ~ Multinomial(l, (q±, . . . , q n )), marginal- 
ization of the AjS is compatible with the Dirichlet results of the previous 
section. That is, 
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We also have a similar result for the combining of two elements of the 
vector, that is, if (X\, . . . ,X n ) ~ Multinomial(l, (qi, . . . ,q n )), then 

I X\, . . .,Xk-i, Xj , Xk + 2 , ■ ■ ■ ,x n J 

V j=k J 

~ Multinomial 1, I qi, . . . , q k ^, 1 - ^ 

V V j^k,k+l J J 

B.3. Multinomial-Dirichlet. Lastly, we see that these marginalization 
patterns persist when we combine the multinomial and Dirichlet. Let the 
matrix A nxn have each row be an independent multinomial as follows: 

(an,. . . ,a in ) ~Multinomial(l,(gi,...,g n )), i = l,...,n, 

(gi, . . . , q n ) ~ Dirichlet (n , . . . , r n ) , 



and then create the matrix A* by adding together rows k + 1, ... ,n, marginal- 
izing (qi, . . . , q n ) in the same way. Then 



P(A*) = j P(A|q)/(q)dq 



where rij = Y,i a iji 1> • • • > k ~ 1 and n k = J2i Y!j=k a ij- 

If we add together rows k and k + 1 in the matrix A to obtain A* , a 
similar result holds: 



r[jyfc,fc+i r ( r i) r (Si=fc r i 



r(n + E- = i^-) 

APPENDIX C: PROPERTIES OF THE MARKOV CHAIN 

C.l. Stationary distributions of (6, A). From the transition kernel of 
1,A) in (24), 

V f K((9, A), (9', A'))tt(9, A\y) d9 
A J& 

= V / n(9'\A,y) [ P(A'\q,9')f(q\A)dqn(9,A\y)d9 
a J@ JQ 
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rn k f(y\e\A')( ni - nk/ )UU^ 



£ A mV(y|M)( B1 .?JIlSLiff?'J 
V >f e f(y\8',A>Me')de 



m k f(y\0,A)7r(0) 



d6 dq, 



and note that the integral cancels J & f(y\6' ,A')tt(9') dO inside the sum 
over A. So the integral becomes 



m 



k f(y\o',A')( 



)U k ' a n[ 



E A ™ k f(q\A)f(y\e>,A)n(8') 



.T, A m k me,A)( ni ^JuU^ 

Now take (3j = 1 for all j = 1, . . . , n so that 

v[2n) k 



TT< 



£ A Je/(y|MM0)«# 

r(2n) 



This cancels out the denominator sum to leave 

r(2n) J Q ^(yl^>(^LA fc ,) nUtf d * 

nl EAfem k f(y\e,A)Tr(9)d6 
and evaluating the integral over q gives 



T(2n) 



ni---n k 



\n^ i £K + i) = 

■J ) T(2n) 



where we do the n dimensional integral with q)j for j > k' . So 



W K((e,A),(e',A'))K(e,A\y)de = = 



m k f(y\e',A')n(e') 



A J e m k f(y\e,A)7r(0)d9 
<0',A'\y). 



C.2. Detailed balance. We have from (27) 

(I»/r(n + m))m k ' T\)' =l r(n$) ( n ,.^, ) UU *i 



K(A,A')tt(A) 



T, A (T(n)/T(n + m))m* \\ k =l I>,) ( n ^J \\ k =1 % 
k 



r(2n) 



30 



M. KYUNG, J. GILL AND G. CASELLA 



x 




/ 



r (T(n)/T(n + m))m k U-=i r K) L "„ J IljU Qj 



Q L (T(n)/r(n + m))m* n J=i r(n,) ( Bl .?J lg=i 9? J 



x 



n-=irK + i) 



r(2n) 




5=1 




fc' 



X 



r(n + m) 



*n r K) ^ 



if(A',A)7r(A'). 
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